In this project we analyzed the Twitter data from SCMP with the goal to understand its Social Media Network and come up with some suggestions of strategy to support its growth worldwide. This project was developed as part of Social Media Course grading at Business Analytics Master Course at Hong Kong University.
To extract the data from Twitter, we used the package rtweet using the term “SCMPNews”. The data was extract three times during the period of Dec/18 and Jan/19:
- First Extraction: 18th - December - 2018
- Second Extraction: 25th - December - 2018
- Third Extraction: 1st - January - 2019
The resulting data was merged, and duplicated tweets were excluded. Besides that, from the original file, were derived three different files that will be used in the further analysis:
- Vertex data: File with information about the user account like number of friends, number of tweets, account language and others.
- Edge data: File with user’s relationship trough tweets, like: retweet, reply, mention and others.
- Tweet Data: Is the original file extract using the Twitter API.
In our final data we have 25.372 uniques tweets, 13.267 users and 50.157 relarionships. We only condidered for this analysis tweets in english.
In average there are 1103 tweets per day in the data, but we clearly see that on the 18th of December, there were some event that made the amount of tweets increase dramatically. After some analysis we figure out that the trending topic at that day was the “USA and China Trade war”. We will show this in details later on.
time=tweet %>% mutate(day = as.Date(cut(created_at, breaks = "day"))) %>%
group_by(day) %>%
summarise(total = n())
time_aux <- tibble::tibble(
time = seq(as.Date("2018-12-10"), as.Date("2019-01-02"), by = "day"))
time <- left_join(time_aux, time, by = c("time" = "day"))
plot_ly(time, x = ~time, y = ~total) %>%
add_lines() %>%
layout(title="Tweets per day") %>%
rangeslider(time_aux$time[1], time_aux$time[5])
This data is characterized by some few users that have a very high level of activity and the others (the majority) with just sporadic activity. These outliers are in general company accounts (like South China Morning Post, obviously) and some famous people that advocates for some causes or are interested in spread certain type of information.
vertice2 <- vertice %>% mutate(account_lang_en = ifelse(account_lang=="en","english","others"))
p <- plot_ly(vertice2,
x = ~followers_count,
y = ~friends_count,
z = ~statuses_count,
color = ~account_lang_en,
colors = c('#00004d', '#e6b800'),
text = ~paste('#Followers',followers_count,'<br>#Friends:',friends_count,'<br>#Statuses:',statuses_count)) %>%
layout(title = 'Followers vs Friends vs Statuses',
scene =
list(xaxis =
list(title = 'Followers',
gridcolor = 'rgb(230, 230, 230)',
zerolinewidth = 1,
ticklen = 5,
gridwidth = 2),
yaxis =
list(title = 'Friends',
gridcolor = 'rgb(230, 230, 230)',
zerolinewidth = 1,
ticklen = 5,
gridwith = 2),
zaxis =
list(title = 'Statuses',
gridcolor = 'rgb(230, 230, 230)',
zerolinewidth = 1,
ticklen = 5,
gridwith = 2)),
annotations =
list(x = 1.13,
y = 1.05,
text = 'Account Language',
showarrow = FALSE),
paper_bgcolor = 'rgb(255, 255, 255)',
plot_bgcolor = 'rgb(255, 255, 255)')
p
Besides that, we see that SCMP has a different behavior when we compare it with other users. For example, most of their activity (“from”) are classified as “tweet” in the edge data, while other users have “mention” as principal activity. It happens because most of their activities are indeed related to posting the link of their news together with some comments about it. On the other hand, when we analyze the relationship “to” SCMP (it means other users interacting with SCMP), we see that the most frequent is “retweet”, while the same analysis with “other users” has, again, mention as principal category.
Finally, we see a lack of hashtag usage in SCMP network. This is related to the fact that SCMP itself doesn’t use much of this strategy in their tweets and, because most of time people are retweeting their posts, this behavior is amplified in the network.
p0 <- tweet %>%
mutate(n_hashtag = case_when( hashtag_count == 0 ~ "a. 0"
,hashtag_count == 1 ~ "b. 1"
,hashtag_count == 2 ~ "c. 2"
,hashtag_count == 3 ~ "d .3"
,hashtag_count >= 4 ~ "e. >3")) %>%
ggplot(aes(x= as.factor(n_hashtag), fill=as.factor(n_hashtag))) +
geom_bar() +
geom_text(aes(label=..count..),stat="count", size = 3.5, position = position_dodge(0.7), vjust = 1, colour = "gray60") +
scale_fill_manual(values = c("midnightblue", "dodgerblue3","slategray3", "darkgoldenrod1", "lightgoldenrod1")) +
labs(title = 'Hashtags per tweet', x = 'Number of Tweets') +
theme_SCMP
p0 <- p0 + theme(legend.position = "none")
p0 <- p0 + theme(plot.title = element_text(size = 11, family = "sans", color = "midnightblue", hjust = 0.5))
teste_scmp <- edge %>% mutate(flag_scmp1 = ifelse(vertice1 == "SCMPNews","1.SCMP","2.Others"),
flag_scmp2 = ifelse(vertice2 == "SCMPNews","1.SCMP","2.Others"))
p1 <- ggplot(teste_scmp, aes(x= relationship)) +
geom_bar(aes(fill=as.factor(flag_scmp1)),position = "dodge") +
scale_fill_manual(values = c("darkgoldenrod1", "midnightblue")) +
labs(title = 'From Node - Relationship Analysis', x = '', y = "") +
theme_SCMP
p1<- p1 + theme(legend.title=element_blank())
p1 <- p1 + theme(plot.title = element_text(size = 11, family = "sans", color = "midnightblue", hjust = 0.5))
p2 <- ggplot(teste_scmp, aes(x= relationship)) +
geom_bar(aes(fill=as.factor(flag_scmp2)),position = "dodge") +
scale_fill_manual(values = c("darkgoldenrod1", "midnightblue")) +
labs(title = 'To Node - Relationship Analysis', x = '', y = "") +
theme_SCMP
p2 <- p2 + theme(legend.title=element_blank())
p2 <- p2 + theme(plot.title = element_text(size = 11, family = "sans", color = "midnightblue", hjust = 0.5))
p3 <- ggplot(teste_scmp, aes(x= relationship)) +
geom_bar(fill="midnightblue") +
labs(title = 'Relationship Analysis', x = ' ', y="") +
theme_SCMP
p3 <- p3 + theme(plot.title = element_text(size = 11, family = "sans", color = "midnightblue", hjust = 0.5))
ggarrange(p0, p1, p2, p3, nrow=2, ncol=2, common.legend = FALSE)
Analysing the Word Cloud, we clearly see some topics that were quite popular during the time of our analysis, like: “Trade War”, “Huawei”, “Xi” (Xi Jinping) and others.
aux <- tweet %>% select(status_id, text, created_at)
aux$stripped_text <- gsub("http.*","", aux$text)
aux$stripped_text <- gsub("https.*","", aux$stripped_text)
aux$stripped_text <- gsub('[^\x20-\x7E]', '', aux$stripped_text)
words <- aux %>%
unnest_tokens(word, stripped_text) %>%
anti_join(stop_words)
words %>%
filter(word != "scmpnews") %>%
count(word) %>%
with(wordcloud(word, n, max.words = 120, rot.per = 0.4, scale=c(4,0.6),
colors=brewer.pal(7, "Set1")))
As the SCMP network is huge with a lot of activities in different context, we decided to study its strategy per topic or the content of the tweet. With this said, to achieve this goal we used some Natural Language Process algorithms to access these topics and find groups of common content. We decided to use LDA (Latent … Algorithm) to this analysis and we found 6 different groups in our data:
- “USA and China”: News related to USA and China relationship. With special attention to the Trade war issue that was quite discussed in this period and that’s the reason why these words are bigger than the others.] - “International”: News related to the international scene. In this part of the cloud we can se the name of some countries like “Canada”, “Japan” and “India”. In this topic the principal news is related to the arrestment of Huawei’s owner daughter in Canada. - “Hong Kong News”: In this topic we have many different topic news but all of them happening or having relationship in Hong Kong. - “Mainland China”: News inside the Mainland China. Some of the popular news in this topic is the ones about the group of Christians that were arrested during this period. - “Sports”: In this case the news that generated most engagement was the one related to some rumors of Japan Olympic games boycott because of the new Japanese police to Whale Hunting. - “Business”: General news about business. In this cloud we can see some terms like: “Jack Ma”, “CEOs”, “wrapping” and “Christmas”.
aux <- tweet %>% select(status_id, text, created_at, topic)
aux$stripped_text <- gsub("http.*","", aux$text)
aux$stripped_text <- gsub("https.*","", aux$stripped_text)
aux$stripped_text <- gsub('[^\x20-\x7E]', '', aux$stripped_text)
topic1 <- aux %>% filter(topic==1) %>% select(stripped_text)
topic1a = paste(topic1$stripped_text, collapse=" ")
topic2 <- aux %>% filter(topic==2) %>% select(stripped_text)
topic2a = paste(topic2$stripped_text, collapse=" ")
topic3 <- aux %>% filter(topic==3) %>% select(stripped_text)
topic3a = paste(topic3$stripped_text, collapse=" ")
topic4 <- aux %>% filter(topic==4) %>% select(stripped_text)
topic4a = paste(topic4$stripped_text, collapse=" ")
topic5 <- aux %>% filter(topic==5) %>% select(stripped_text)
topic5a = paste(topic5$stripped_text, collapse=" ")
topic6 <- aux %>% filter(topic==6) %>% select(stripped_text)
topic6a = paste(topic6$stripped_text, collapse=" ")
all = c(topic1a, topic2a, topic3a, topic4a, topic5a, topic6a)
all = removeWords(all, c(stopwords("english")))
corpus = Corpus(VectorSource(all))
tdm = TermDocumentMatrix(corpus)
tdm2 <- removeSparseTerms(tdm, 0.7)
# convert as matrix
tdm = as.matrix(tdm)
# add column names
colnames(tdm) = c("HK News", "International", "USA and China", "Business", "Sports", "Mainland China")
#, "topic7", "topic8", "topic9")
comparison.cloud(tdm, random.order=FALSE,scale=c(4,0.4),
colors = c("midnightblue", "darkgoldenrod1", "goldenrod4", "dodgerblue3", "gray44", "darkorange3"),
title.size=0.8,
max.words=5000)
To help understand the behavior of each group of users inside the sub topic networks, we also developed a user clustering to understand the different kind of users in our data and how they interact with the different topics. For this task we used a K-mean algorithm, with 9 groups.
Another interesting topic in social media is Sentiment Analysis and specially in the context of news this analysis helps understand how users react to each topic. It can also help to understand how the opinion is spread along the relationships trough the network. As we can see from the cloud bellow, some words like: “american”, “propaganda”, “trade”, “war” and “China” are very frequent in posts classified as negative. While words like: “Jack Ma”, “founder”, “CEOs” and “bravo” are frequent in positive posts.
Twitter Icon made by Freepik from www.flaticon.com is licensed by CC 3.0 BY
7 Social Network Graphs
After modeling the topic, user cluster and sentiment, we were able to put all those pieces together and split the data into sub - networks as shown in the following image: